Reading Data into dataframe

data <- read.csv("insurance.csv")
head(data)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Identifying Response and Predictor Variable and their types

Predictor variables:

Categorical : sex, smoker, region

Continuous : age, bmi, children

Response variable : charges

Fitting a multiple linear regression model on all the variables

lin_model <- lm(charges~., data = data)
summary(lin_model)
## 
## Call:
## lm(formula = charges ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

As observed, sex predictor has a high p-value. So we cannot reject the null hypothesis. This is not a significant predictor.

Fitting a multiple linear regression model on all the variables except ‘sex’

lin_model <- lm(charges~.-sex, data = data)
summary(lin_model)
## 
## Call:
## lm(formula = charges ~ . - sex, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16

As observed, region predictor has a high p-value. So we cannot reject the null hypothesis. This is not a significant predictor. It is only significant for particular values of alpha, hence we decide to remove it.

Fitting a multiple linear regression model on all the variables except ‘sex’ and ‘region’

lin_model2 <- lm(charges ~ (.-sex-region), data = data)
summary(lin_model2)
## 
## Call:
## lm(formula = charges ~ (. - sex - region), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11897.9  -2920.8   -986.6   1392.2  29509.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12102.77     941.98 -12.848  < 2e-16 ***
## age            257.85      11.90  21.675  < 2e-16 ***
## bmi            321.85      27.38  11.756  < 2e-16 ***
## children       473.50     137.79   3.436 0.000608 ***
## smokeryes    23811.40     411.22  57.904  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7489 
## F-statistic: 998.1 on 4 and 1333 DF,  p-value: < 2.2e-16

We observe that all predictors have less p-values and for all the null hypothesis can be rejected. All these predictors are significant for the model. We also observe that with these predictors the model can predict 74.97% of outputs.

Checking Autocorrelation

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
dwtest(lin_model2)
## 
##  Durbin-Watson test
## 
## data:  lin_model2
## DW = 2.0874, p-value = 0.945
## alternative hypothesis: true autocorrelation is greater than 0

DW=2 suggests no autocorrelation

Plotting graphs to observe problems

plot(lin_model2)

From the residual vs fitted graph, we observe that the data shows pattern and is not random, hinting at the presence of heteroscedasticity. From the Q-Q plot, we observe that the data is skewed and not normal. From residuals vs leverage plot, we observe that there are no points that are beyond cook’s distance. So, the model does not have influential points.

Conducting BP Test on model

library(lmtest)
bptest(lin_model2)
## 
##  studentized Breusch-Pagan test
## 
## data:  lin_model2
## BP = 117.16, df = 4, p-value < 2.2e-16

The Breusch-Pagan (BP) test indicates that there is heteroscedasticity in the data. So, we proceed to transform the data.

Log Transformation

log_model <- lm(log(charges) ~ age+bmi+children+smoker,data=data)
summary(log_model)
## 
## Call:
## lm(formula = log(charges) ~ age + bmi + children + smoker, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11340 -0.19883 -0.04688  0.07197  2.07581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.9827766  0.0697227 100.151  < 2e-16 ***
## age         0.0347826  0.0008805  39.502  < 2e-16 ***
## bmi         0.0106096  0.0020264   5.236 1.91e-07 ***
## children    0.1011976  0.0101989   9.922  < 2e-16 ***
## smokeryes   1.5432438  0.0304372  50.703  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4491 on 1333 degrees of freedom
## Multiple R-squared:  0.7622, Adjusted R-squared:  0.7614 
## F-statistic:  1068 on 4 and 1333 DF,  p-value: < 2.2e-16
plot(log_model)

Conducting BP Test on log transformed model

bptest(log_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  log_model
## BP = 80.917, df = 4, p-value < 2.2e-16

This is indicating that even with the log-transform the heteroscedasticity is still present, so we use square transform.

Square Transformation

model_square <- lm((charges^2) ~ age+bmi+children+smoker,data=data)
summary(model_square)
## 
## Call:
## lm(formula = (charges^2) ~ age + bmi + children + smoker, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -707112670 -144270762  -18280529  120504755 2730241564 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -793695547   52779464 -15.038   <2e-16 ***
## age            6834317     666556  10.253   <2e-16 ***
## bmi           20298532    1533972  13.233   <2e-16 ***
## children       8590667    7720482   1.113    0.266    
## smokeryes   1057514060   23040686  45.898   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.4e+08 on 1333 degrees of freedom
## Multiple R-squared:  0.6435, Adjusted R-squared:  0.6425 
## F-statistic: 601.6 on 4 and 1333 DF,  p-value: < 2.2e-16
plot(model_square)

Conducting BP Test on square transformed model

bptest(model_square)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_square
## BP = 218.78, df = 4, p-value < 2.2e-16

This also did not help much with the heteroscedasticity.

Square Root Transformation

model_square_root <- lm((charges^0.5) ~ age+bmi+children+smoker,data=data)
summary(model_square_root)
## 
## Call:
## lm(formula = (charges^0.5) ~ age + bmi + children + smoker, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.836 -11.452  -4.541   3.376 103.942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.87277    3.50841  -0.249    0.804    
## age          1.40495    0.04431  31.709  < 2e-16 ***
## bmi          0.92997    0.10197   9.120  < 2e-16 ***
## children     3.25331    0.51320   6.339 3.15e-10 ***
## smokeryes   90.55542    1.53158  59.125  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.6 on 1333 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7762 
## F-statistic:  1160 on 4 and 1333 DF,  p-value: < 2.2e-16
plot(model_square_root)

Conducting BP Test on square root transformed model

bptest(model_square_root)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_square_root
## BP = 29.317, df = 4, p-value = 6.741e-06

Even this does not help with the problem. Another transform that can be used is WLS method.

Performing transformation by Weighted Least Square (WLS) method

weights <- 1 / lm(abs(residuals(lin_model2)) ~ fitted(lin_model2))$fitted.values^2

model_wls <- lm(charges ~ age + bmi + children + smoker, data = data, weights = weights)
summary(model_wls)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = data, 
##     weights = weights)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0860 -0.7195 -0.4814 -0.0409 13.1996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3934.17     663.76  -5.927 3.92e-09 ***
## age           243.16      10.49  23.183  < 2e-16 ***
## bmi            70.29      22.25   3.158  0.00162 ** 
## children      645.13     117.82   5.476 5.20e-08 ***
## smokeryes   22725.01     777.53  29.227  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 1333 degrees of freedom
## Multiple R-squared:  0.5538, Adjusted R-squared:  0.5525 
## F-statistic: 413.6 on 4 and 1333 DF,  p-value: < 2.2e-16
plot(model_wls)

Conducting BP Test on WLS transformed model

bptest(model_wls)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_wls
## BP = 1.2726e-05, df = 4, p-value = 1

The bp test for the model after transforming with WLS indicates that there is no heteroscedasticity.

Checking multicollinearity

library(car)
## Loading required package: carData
vif(model_wls)
##      age      bmi children   smoker 
## 1.073255 1.042860 1.033168 1.004907

All values are very close to 1 suggesting that there is no issue of multicollinearity

Interpretation of results

The low p-value of the F-statistic indicates that there is at least one independent variable that helps predict the target variable.

Age, bmi, children, and smoker are all significant features and together give the model an R-squared value of 0.5538. This means that the features together can predict 55.38% of the outputs.

Our model is charges = -3934.17 + 243.16age + 70.29bmi + 645.13children + 22725.01smoker

For a smoker, charges = 243.16age + 70.29bmi + 645.13children + (22725.01-3934.17) For a non smoker, charges = 243.16age + 70.29bmi + 645.13children - 3934.17

For every observation, charges increases by an average of beta1*w for a unit average increase of age, where w is the weight for that observation and all other predictors are fixed

For every observation, charges increases by an average of beta2*w for a unit average increase of bmi, where w is the weight for that observation and all other predictors are fixed

For every observation, charges increases by an average of beta3*w for a unit average increase of children, where w is the weight for that observation and all other predictors are fixed

For every observation, charges is beta4*w more for a smoker than a non-smoker, where w is the weight for that observation and all other predictors are fixed

Interaction term observation

model1<-lm(charges~age+bmi+children+smoker+region+sex+age*bmi+age*children+age*smoker+age*region+age*sex+bmi*children+bmi*smoker+bmi*region+bmi*sex+children*smoker+children*region+children*sex+smoker*region+smoker*sex+region*sex,data=data)
summary(model1)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region + 
##     sex + age * bmi + age * children + age * smoker + age * region + 
##     age * sex + bmi * children + bmi * smoker + bmi * region + 
##     bmi * sex + children * smoker + children * region + children * 
##     sex + smoker * region + smoker * sex + region * sex, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11971.4  -1979.7  -1233.8   -212.7  30056.8 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.682e+03  2.591e+03  -0.649 0.516186    
## age                        1.919e+02  5.292e+01   3.626 0.000298 ***
## bmi                        4.958e+01  8.392e+01   0.591 0.554788    
## children                   9.058e+02  6.729e+02   1.346 0.178485    
## smokeryes                 -2.103e+04  1.937e+03 -10.856  < 2e-16 ***
## regionnorthwest           -1.611e+03  2.294e+03  -0.702 0.482761    
## regionsoutheast            2.941e+03  2.193e+03   1.341 0.180147    
## regionsouthwest           -4.435e+02  2.191e+03  -0.202 0.839615    
## sexmale                   -1.260e+03  1.581e+03  -0.797 0.425510    
## age:bmi                    1.168e+00  1.635e+00   0.714 0.475086    
## age:children              -3.123e+00  8.607e+00  -0.363 0.716839    
## age:smokeryes             -3.251e+00  2.400e+01  -0.135 0.892263    
## age:regionnorthwest        1.975e+01  2.741e+01   0.721 0.471192    
## age:regionsoutheast        5.074e+01  2.749e+01   1.846 0.065153 .  
## age:regionsouthwest        4.764e+01  2.787e+01   1.709 0.087656 .  
## age:sexmale                1.577e+01  1.915e+01   0.823 0.410556    
## bmi:children               4.843e-01  1.916e+01   0.025 0.979840    
## bmi:smokeryes              1.487e+03  5.649e+01  26.330  < 2e-16 ***
## bmi:regionnorthwest       -7.191e+00  7.019e+01  -0.102 0.918420    
## bmi:regionsoutheast       -1.893e+02  6.113e+01  -3.097 0.001995 ** 
## bmi:regionsouthwest       -8.750e+01  6.727e+01  -1.301 0.193584    
## bmi:sexmale                3.576e+00  4.636e+01   0.077 0.938528    
## children:smokeryes        -3.848e+02  2.878e+02  -1.337 0.181483    
## children:regionnorthwest   2.625e+02  3.256e+02   0.806 0.420147    
## children:regionsoutheast  -2.073e+02  3.229e+02  -0.642 0.520963    
## children:regionsouthwest  -3.803e+02  3.100e+02  -1.227 0.220035    
## children:sexmale          -2.456e+02  2.236e+02  -1.098 0.272196    
## smokeryes:regionnorthwest -1.610e+02  9.740e+02  -0.165 0.868728    
## smokeryes:regionsoutheast -1.145e+03  9.273e+02  -1.235 0.217237    
## smokeryes:regionsouthwest  1.016e+03  9.872e+02   1.029 0.303475    
## smokeryes:sexmale         -2.702e+01  6.792e+02  -0.040 0.968275    
## regionnorthwest:sexmale    3.828e+02  7.660e+02   0.500 0.617308    
## regionsoutheast:sexmale    6.368e+02  7.705e+02   0.826 0.408682    
## regionsouthwest:sexmale    2.631e+02  7.702e+02   0.342 0.732700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4835 on 1304 degrees of freedom
## Multiple R-squared:  0.8445, Adjusted R-squared:  0.8406 
## F-statistic: 214.7 on 33 and 1304 DF,  p-value: < 2.2e-16

The only significant synergy is bmi with smoker

Fitting a multiple linear regression model with synergy term

interaction_term_model <- lm(charges ~ age+bmi+children+smoker+bmi*smoker, data = data)
summary(interaction_term_model)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + bmi * 
##     smoker, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14598.6  -1924.4  -1321.4   -465.6  29892.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2729.002    831.270  -3.283  0.00105 ** 
## age              264.948      9.553  27.735  < 2e-16 ***
## bmi                5.656     24.873   0.227  0.82014    
## children         508.924    110.615   4.601 4.61e-06 ***
## smokeryes     -20194.709   1654.505 -12.206  < 2e-16 ***
## bmi:smokeryes   1433.788     52.823  27.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4871 on 1332 degrees of freedom
## Multiple R-squared:  0.8388, Adjusted R-squared:  0.8382 
## F-statistic:  1387 on 5 and 1332 DF,  p-value: < 2.2e-16

We observe that the inclusion of synergy term helps the model predict 83.88% of values. This is 28.5% more than the transformed model.

Plotting to observe problems

plot(interaction_term_model)

From the residual vs fitted graph, we observe that the data shows pattern and is not random, hinting at the presence of heteroscedasticity. From the Q-Q plot, we observe that the data is skewed and not normal. From residuals vs leverage plot, we observe that there are no points that are beyond cook’s distance. So, the model does not have influential points.

Conducting BP Test

bptest(interaction_term_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  interaction_term_model
## BP = 7.0967, df = 5, p-value = 0.2136

BP test suggests that there is no heteroscedasticity in this model.

Interpretation of results with synergy

The low p-value of the F-statistic indicates that there is at least one independent variable that helps predict the target variable.

Age, children, smoker, and synergy term are all significant features and together give the model. bmi is not a significant feature as it has a high p-value but we include it due to the inclusion of the synergy term. R-squared value is 0.8388. This means that the features together can predict 83.88% of the outputs.

Our model is charges = -2729.002 + 264.948age + 5.656bmi + 508.924children - 20194.709smoker + 1433.788bmi*smoker

For a smoker, charges = -(2729.002 + 20194.709) + 264.948age + (5.656 + 1433.788)bmi + 508.924children

For a non smoker, charges = -2729.002 + 264.948age + 5.656bmi + 508.924children

Charges increases by an average of beta1 for a unit average increase of age, where all other predictors are fixed

Charges increases by an average of beta2 for a unit average increase of bmi, where all other predictors are fixed

Charges increases by an average of beta3 for a unit average increase of children, where all other predictors are fixed

In case if the person is smoker, the charges decreases by an average of 20194.709 units with an average increase of 1433.788 for unit increase in bmi. This is the change over the base case of non-smoker.